Skip to content

Fix broken WSLCorePort channel after receive timeout#14455

Merged
chemwolf6922 merged 26 commits intomicrosoft:masterfrom
chemwolf6922:fix-broken-state-after-transaction-timeout
Apr 21, 2026
Merged

Fix broken WSLCorePort channel after receive timeout#14455
chemwolf6922 merged 26 commits intomicrosoft:masterfrom
chemwolf6922:fix-broken-state-after-transaction-timeout

Conversation

@chemwolf6922
Copy link
Copy Markdown
Contributor

@chemwolf6922 chemwolf6922 commented Mar 17, 2026

Summary of the Pull Request

This pattern shows up in multiple sleep -> wake -> wsl stuck reports:
image

In the current sequence number logic, the receive sequence will ++ without receiving any message. If timeout is allowed on the channel and it's not destroyed, the next receive will always get the N-1 message due to:

  1. The stale message arrived after the timeout. Or,
  2. The reply end never received the request, so it's send counter is N-1.

This will lock the channel in an unusable state.

This PR makes these changes to keep the WSLCorePort channels working after a receive timeout.

  1. Replace the sequence number with transaction id. And add transaction step in the message header.
  2. The transaction and non-transaction messages are differentiated by the transaction step.
  3. The old sequence number logic stays the same for non-transaction messages.
  4. For transaction messages. The reply side will reply with the same id in the incoming transaction request.
  5. For transaction messages. The receive loop will discard stale messages based on the transaction id.

I tried my best to update all the applicable receive and sender ends to use the transaction logic. But please help double check I'm not missing anything. Since a mismatch will cause the transaction to always fail.

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

Failed unit tests


UnitTests::UnitTests::Warnings [Failed]
Error: Verify: AreEqual(expectedWarnings, warnings) - Values (, wsl: Due to a current compatibility issue with Global Secure Access Client, DNS Tunneling is disabled.
) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::Warnings::<lambda_1>::operator (), Line: 2022]

UnitTests::UnitTests::KernelModules [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D9868B9) Exception(57) tid(8178) 8000FFFF Catastrophic failure
    Msg:[Command "C:\WINDOWS\system32\wsl.exe echo ok"returned unexpected exit code (0 != -1). Stdout: 'ok
'Stderr: 'wsl: The .wslconfig setting 'wsl2.kernel' is disabled by the computer policy.
'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::VersionFlavorParsing [Failed]
WSL1 is disabled by the computer policy.
Please run 'wsl.exe --set-version tmpdistro 2' to upgrade to WSL2.
Error code: Wsl/Service/CreateInstance/WSL_E_WSL1_DISABLED
Error: Verify: AreEqual(LxsstuLaunchWsl(std::format(L"-d {} cat /etc/os-release || true", Distro).c_str()), 0L) - Values (4294967295, 0) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::VersionFlavorParsing::<lambda_1>::operator (), Line: 3924]

UnitTests::UnitTests::CaseSensitivity [Failed]
Verify: IsFalse(getCaseSensitivity(std::format(L"{}/l1/l2/l3-other", testDir)))
Error: Caught std::exception: D:\workspace\WSL\src\windows\common\filesystem.cpp(339)\wsltests.dll!00007FFD0D2E0169: (caller: 00007FFD0D07A634) Exception(11) tid(1a18) C0000101     [`anonymous-namespace'::EnsureCaseSensitiveDirectoryRecursive::<lambda_1>::operator ()(NtSetInformationFile(Directory, &IoStatus, &CaseInfo, sizeof(CaseInfo), FileCaseSensitiveInformation))]

UnitTests::UnitTests::CustomModulesVhd [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D987DAF) Exception(1) tid(3f90) 8000FFFF Catastrophic failure
    Msg:[Command "Powershell -NoProfile -Command "$acl = Get-Acl 'D:\workspace\WSL\test-modules.vhd' ; $acl.RemoveAccessRuleAll((New-Object System.Security.AccessControl.FileSystemAccessRule(\"Everyone\", \"Read\", \"None\", \"None\", \"Allow\"))); Set-Acl -Path 'D:\workspace\WSL\test-modules.vhd' -AclObject $acl""returned unexpected exit code (1 != 0). Stdout: ''Stderr: 'Get-Acl : The 'Get-Acl' command was found in the module 'Microsoft.PowerShell.Security', but the module could not be loaded. For more informa
tion, run 'Import-Module Microsoft.PowerShell.Security'.
At line:1 char:8
+ $acl = Get-Acl 'D:\workspace\WSL\test-modules.vhd' ; $acl.RemoveAcces ...
+        ~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Get-Acl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

You cannot call a method on a null-valued expression.
At line:1 char:54
+ ... ules.vhd' ; $acl.RemoveAccessRuleAll((New-Object System.Security.Acce ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvokeMethodOnNull

Set-Acl : The 'Set-Acl' command was found in the module 'Microsoft.PowerShell.Security', but the module could not be loaded. For more informa
tion, run 'Import-Module Microsoft.PowerShell.Security'.
At line:1 char:190
+ ... sRule("Everyone", "Read", "None", "None", "Allow"))); Set-Acl -Path ' ...
+                                                           ~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Set-Acl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::BrokenDistroImport [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D987DAF) Exception(1) tid(13dc8) 8000FFFF Catastrophic failure
    Msg:[Command "Powershell -NoProfile -Command "New-Vhd EmptyVhd.vhdx  -SizeBytes 20MB""returned unexpected exit code (1 != 0). Stdout: ''Stderr: 'New-Vhd : Failed to create the virtual hard disk.
The system failed to create 'D:\workspace\WSL\EmptyVhd.vhdx'.
Failed to create the virtual hard disk.
The system failed to create 'D:\workspace\WSL\EmptyVhd.vhdx': The file exists. (0x80070050).
At line:1 char:1
+ New-Vhd EmptyVhd.vhdx  -SizeBytes 20MB
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [New-VHD], VirtualizationException
    + FullyQualifiedErrorId : OperationFailed,Microsoft.Vhd.PowerShell.Cmdlets.NewVhd

'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::WslDebug [Failed]
wsl: The .wslconfig setting 'wsl2.kernelCommandLine' is disabled by the computer policy.
Error: Verify: AreEqual(LxsstuLaunchWsl(L"dmesg | grep -iF 'vmbus_send_tl_connect_request'"), 0L) - Values (1, 0) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::WslDebug, Line: 6425]

UnitTests::UnitTests::CGroupv1 [Failed]
Error: Verify: AreEqual(out, expected) - Values (/sys/fs/cgroup/unified cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate
, ) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::CGroupv1::<lambda_1>::operator (), Line: 6435]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a broken channel state that occurs after a transaction timeout in WSL's socket-based IPC protocol. The issue (#14193, #14055) manifests after laptop sleep/hibernate, where a channel's expected sequence number gets desynchronized, causing all subsequent communication to fail until wsl --shutdown.

Changes:

  • Replace independent sender/receiver sequence counters with an echo-back mechanism: the responder echoes back the request's sequence number in its reply, preventing desync after timeouts.
  • Add a magic number field to MESSAGE_HEADER for early framing corruption detection, and skip stale (timed-out) replies in the receive loop.
  • Zero-initialize a Reply union in binfmt.cpp to ensure the new MessageMagic default initializer doesn't cause issues with raw read() calls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/shared/inc/lxinitshared.h Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for new struct size.
src/shared/inc/SocketChannel.h Rewrote send/receive sequence logic to echo-back model; added stale message skipping loop; replaced m_received_messages with m_expected_reply_sequence / m_pending_reply_sequence.
src/shared/inc/socketshared.h Added magic number validation in RecvMessage before processing header.
src/linux/init/binfmt.cpp Zero-initialized Reply union to handle new MessageMagic default member initializer.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/shared/inc/SocketChannel.h Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 17, 2026 09:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a broken channel state issue in WSL's SocketChannel that occurs after a transaction timeout (e.g., when resuming from sleep). Previously, a timeout would increment the expected message ID on the receiver side, but the sender wouldn't use that incremented ID, causing a permanent ID desync and locking the channel. The fix replaces independent sequence tracking with an echo-back mechanism where the responder echoes back the request's sequence number in its reply, and the requester skips stale replies from previously timed-out transactions.

Changes:

  • Added a magic number field to MESSAGE_HEADER and validated it in RecvMessage to detect framing corruption early.
  • Replaced independent sequence counters with an echo-back sequence mechanism in SocketChannel using m_expected_reply_sequence and m_pending_reply_sequence, with a loop to skip stale replies.
  • Zero-initialized a union in binfmt.cpp to ensure the new MessageMagic field is properly initialized when reading responses.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/shared/inc/lxinitshared.h Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for LX_GNS_SET_PORT_LISTENER size.
src/shared/inc/socketshared.h Added magic number validation in RecvMessage after reading the header.
src/shared/inc/SocketChannel.h Replaced send/receive sequence tracking with echo-back mechanism; added stale-reply skip loop; removed sequence parameter from ValidateMessageHeader.
src/linux/init/binfmt.cpp Zero-initialized Reply union to ensure MessageMagic defaults correctly.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/shared/inc/SocketChannel.h Outdated
Feng Wang added 2 commits March 17, 2026 17:31
…om:chemwolf6922/WSL into fix-broken-state-after-transaction-timeout
Copilot AI review requested due to automatic review settings March 20, 2026 03:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent SocketChannel protocol desynchronization after a transaction timeout (the “expected sequence” advancing while a late response with the previous sequence arrives), which can leave a channel in a permanently broken state.

Changes:

  • Reworks SocketChannel sequencing to echo request sequence numbers in replies and discard stale replies.
  • Adds new per-channel state (m_expected_reply_sequence / m_pending_reply_sequence) to track request/reply sequencing.
  • Updates protocol error handling to validate type separately from sequencing.

Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/shared/inc/SocketChannel.h Outdated
Copilot AI review requested due to automatic review settings March 20, 2026 05:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/shared/inc/SocketChannel.h
Comment thread src/shared/inc/SocketChannel.h Outdated
Copilot AI review requested due to automatic review settings March 20, 2026 07:50
Copilot AI review requested due to automatic review settings April 2, 2026 09:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (4)

src/linux/init/config.cpp:356

  • The function signature now takes a Transaction& for sending replies, but the argument documentation still refers to ResponseChannel. Please update the comment block so the parameter names/descriptions match the new signature (e.g., describe that replies must be sent on the provided transaction).
void ConfigHandleInteropMessage(
    wsl::shared::Transaction& Transaction,
    wsl::shared::SocketChannel& InteropChannel,
    bool Elevated,
    gsl::span<gsl::byte> Message,
    const MESSAGE_HEADER* Header,
    const wsl::linux::WslDistributionConfig& Config)

/*++

Routine Description:

    This routine handles a message received from a Linux client using init's
    interop socket.

Arguments:

    ResponseChannel - Supplies channel used to send responses.

    InteropChannel - Supplies a channel to the host to be used for create
        process requests.

src/linux/init/config.cpp:637

  • The argument documentation still says MessageFd is used to send the response, but the function now takes a SendResponse callback. Please update the comment to reflect the new response mechanism to avoid confusion for future maintenance.
int ConfigInitializeInstance(const std::function<void(gsl::span<gsl::byte>&)>& SendResponse, gsl::span<gsl::byte> Buffer, wsl::linux::WslDistributionConfig& Config)

/*++

Routine Description:

    This routine initializes the instance's externally controlled state, which
    is received from the service.

    N.B. When setting these values errors are treated as non-fatal to account
         for unexpected distro state.

Arguments:

    MessageFd - Supplies a file descriptor to send the response message.

    Buffer - Supplies the message buffer.

src/linux/init/init.cpp:2687

  • The doc comment still refers to a Channel parameter for sending the terminate response, but the signature now takes a SendResult callback. Please update the comment block to match the new parameters.
void InitTerminateInstance(gsl::span<gsl::byte> Buffer, const std::function<void(bool)>& SendResult, wsl::linux::WslDistributionConfig& Config)

/*++

Routine Description:

    This routine processes a terminate instance request from the service.

Arguments:

    Buffer - Supplies the message buffer.

    Channel - Supplies a channel to send the response.

src/linux/init/main.cpp:3179

  • The ProcessMessage signature now takes a wsl::shared::Transaction&, but the argument comment block still describes a MessageFd/file descriptor used to send responses. Please update the documentation to reflect that replies should be sent via the provided transaction (or its underlying channel) rather than a raw fd.
int ProcessMessage(wsl::shared::Transaction& Transaction, LX_MESSAGE_TYPE Type, gsl::span<gsl::byte> Buffer, VmConfiguration& Config)

/*++

Routine Description:

    This routine processes messages from the service.

Arguments:

    MessageFd - Supplies a file descriptor to the socket on which the message was
        received. This is used for operations that require responses, for example a
        VHD eject request.

Comment thread src/linux/init/init.cpp Outdated
Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/shared/inc/SocketChannel.h Outdated
Comment thread src/windows/service/exe/LxssCreateProcess.h Outdated
@chemwolf6922
Copy link
Copy Markdown
Contributor Author

Hi @OneBlue . I updated the design to use transaction id + transaction step to better identify different messages. Please help review. Please also help verify if I updated all the correct pairs. Thanks.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Comment thread src/linux/init/GnsEngine.cpp
Comment thread src/linux/init/config.h Outdated
@chemwolf6922
Copy link
Copy Markdown
Contributor Author

Hi @benhillis, I resolved the copilot comments. Could you please help take another look? Thanks.

Copilot AI review requested due to automatic review settings April 16, 2026 06:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Comment thread src/shared/inc/SocketChannel.h
Comment thread src/shared/inc/SocketChannel.h
Comment thread src/shared/inc/lxinitshared.h
Copy link
Copy Markdown
Collaborator

@OneBlue OneBlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, very clean and should help us catch issues & recover from socket timeouts.

Couple minor comments, but I think we can merge this soon. Thank you for doing this !

Comment thread src/linux/init/config.cpp
Comment thread src/linux/init/GnsEngine.cpp
Comment thread src/linux/init/init.cpp
Comment thread src/shared/inc/SocketChannel.h
@OneBlue OneBlue self-requested a review April 16, 2026 21:05
Copilot AI review requested due to automatic review settings April 17, 2026 04:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Comment thread src/shared/inc/SocketChannel.h
@chemwolf6922 chemwolf6922 merged commit f05690c into microsoft:master Apr 21, 2026
7 checks passed
@chemwolf6922 chemwolf6922 deleted the fix-broken-state-after-transaction-timeout branch April 21, 2026 02:14
shuaiyuanxx added a commit that referenced this pull request Apr 21, 2026
Resolve conflicts introduced by #14455 (transaction-based SocketChannel
refactor) against the TraceLoggingActivity rework:

- WslCoreInstance.cpp WaitForInitConfigResponse: keep the narrowed
  activity scope; switch the receive from
  m_initChannel->GetChannel().ReceiveMessage() to transaction.Receive()
  using the transaction that master starts for the preceding Send.
- WslCoreVm.cpp SendLaunchInit: keep the activity scope and move the
  StartTransaction() + Send inside it. The transaction object is only
  used for this one-shot send (no matching receive on the same id
  within this function), so scoping it to the activity block preserves
  master's semantics.
benhillis pushed a commit that referenced this pull request Apr 21, 2026
Refactor the socket channel to resolve the broken state after a timeout:
* Replace the sequence number with transaction id. And add transaction step in the message header.
* The transaction and non-transaction messages are differentiated by the transaction step.
* The old sequence number logic stays the same for non-transaction messages.
* For transaction messages. The reply side will reply with the same id in the incoming transaction request.
* For transaction messages. The receive loop will discard stale messages based on the transaction id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants